Ideas and exercises come from https://r4ds.had.co.nz/transform.html
Additional notes by TCS
First, we load the tidyverse package and a dataset. This
data frame contains all 336,776 flights that departed from New York City
in 2013.
The Lahman baseball dataset is also used.
require(nycflights13)
require(tidyverse)
require(hms)
require(Lahman)
flights
NA
There are 5 key functions (“verbs”), plus a helper function, that do most data manipulation tasks in dplyr:
filter: pick observations by valuesarrange: reorder rowsselect: pick variables by namemutate: create new variables from existing ones, using
functionssummarise: collapse values into single onesgroup_by: change scope of a verb from the whole dataset
to individual groupsfilter() gives you a subset of rows based on
valuesfilterfilter() includes ONLY rows where the condition is TRUE;
it excludes both FALSE and NA values. If you want to preserve missing
values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
(biggerthan1 <- filter(df, x > 1))
(bigorNA <- filter(df, is.na(x) | x > 1))
arrange() is for sorting rowsarrange() are a dataframe and column
name(s).desc option reverses the order.desc()select() gives you a subset of columns by nameYou can name each column, or specify a range using a colon.
As with other R selections, you can omit certain columns using the minus
sign.
You can add multiple arguments to include more columns in the
selection.
select() does not have to use exact column matches.
starts_with("foo")ends_with("bar")contains("foobar")matches(some_regex)num_range("x", 1:3) matches x1, x2 and x3select()rename() is considered a variant of select() where you
take a column, change its name, and keep all other columns as well.everything() is a helper for select() that lets you
move one or a few columns to the beginning (left) of the table, while
retaining all other columns.mutate() is for creating new variables from oldtransmute() is a variant where you keep ONLY the newly
defined columns(flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
))
(flights_less_sml <- mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
))
(flights_even_more <- mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
(flights_new <- transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
mutate()Some examples:
* Arithmetic, logs (vectors recycle to match length)
* Modulus arithmetic: %/% (integer division) and %% (remainder), where x
== y * (x %/% y) + (x %% y)
* Surrounding values: lead(), lag()
* Cumulative values such as cumsum() and cumprod()
* With the Rcpproll package, rolling sums, etc.
* Ranking such as min_rank()
mutate()Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
*The times are in HHMM format, so straight subtraction does not work. We can convert to time objects, or use modulus arithmetic to convert to minutes and back to time.
We expect dep_delay = sched_dep_time - dep_time. However, for > 60 min difference, we have the same time issue, and solution, as in exercise 2.
rank() goes smallest to largest, gives decimals
(averages) for ties, 1, 2.5, 2.5, 4,… and puts NAs last by default, but
can change behavior using na.last.
rank(desc()) goes largest to smallest.rank() has different methods of handling ties,
producing all-integer output. First; last; random; max (max rank of all
ties so there might not be a #1); min (min rank of all tied elements “as
used in sports,” so you can “tie for 1st” – (1,1,3,4…)) – this is
min_rank(). first, last, random
produce all-unique ranks.rank() functions. You could use
arrange() and then row numbers.The 1:3 vector is recycled, so you get 4+1, 5+2, etc.
The usual functions, plus a few others. Angles are in radians.
cospi(x) and kin give the functions of pi times x, only for
x = multiples of 0.5
# The time is in HMM or HHMM format. Therefore to get the hours we divide by 100, and for minutes we take the remainder. (Did not deal with the times = 2400!)
(better_time <- mutate(flights, dep_min_since_midnight = (dep_time %/% 100) * 60 + dep_time %% 100, sched_dep_min_since_midnight = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100))
(compare <- select(flights, air_time, arr_time, dep_time) %>% mutate(spent_time = arr_time-dep_time))
# the times are in HHMM format, so straight subtraction does not work
(realtimes <- mutate(flights,
arr_time_hms = hms(NULL, arr_time %% 100, arr_time %/% 100),
dep_time_hms = hms(NULL, dep_time %% 100, dep_time %/% 100),
spent_time_hms = difftime(arr_time_hms, dep_time_hms, units = "mins")))
# mutate to add the ranking function (alternatively you could probably arrange and then take by row number)
# if you preserve or average ties, you could end up with more or fewer than 10 in the result. Therefore to get exactly 10 we will break ties randomly.
(arr_ranked <- mutate(flights, arr_delay_rank = rank(desc(arr_delay), ties.method = "random")))
(top_10_delay <- arr_ranked %>% filter(arr_delay_rank <= 10))
summarise() collapses groups into single values*When using summarise() you usually also want groups
created by group_by(). (If you didn’t have a group, you
could just call the desired functions on whole columns.)
*You can group by multiple levels
*If you need to remove a grouping use ungroup()
*Using the pipe we can quickly feed groups into summaries and do other useful tasks efficiently.
*Among tidyverse functions, only ggplot2 doesn’t work as well with pipes: “it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.” But it is used in some examples below.
# with pipe
(delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL"))
*Beware – if you have any missing values (NAs) in your data your
summary value will also be NA, unless you set
na.rm = TRUE.
mean(arr_delay[arr_delay > 0]) = the mean of all those
values > 0# plot delay vs distance
(p <- ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
)
# remove NAs when computing the summary
(delayed <- flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE)))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# remove NAs by removing data at the beginning
(not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
)
(true_delayed <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE)))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# which planes have the greatest delays?
(delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
))
# histogram of average delay times
(p1 <- ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10))
# remove NAs and cancelled
(delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
))
# scatterplot of average delay vs n (rotated histogram actually, but you can see more points)
# shows that the high averages are almost all based on few data points
(p3 <- ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10))
# filter out the points with low n before plotting
(p4 <- delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10))
# instead of averaging early and late flights together, start with the logical subset of the positive (late) delays
(not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# looking at standard deviation of distance to each destination
(spread_of_origins <- not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd)))
# When do the first and last flights leave each day?
(first_and_last <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
(first_last_2 <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time), # This works ONLY because the table is already ordered by dep_time!
last_dep = last(dep_time)
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# another way to find the 1st and last per day
(rankfilt <- not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))) # range gives you the min and max values
You can group hierarchically and see summaries within summaries. You can ungroup if needed.
# group multiple levels
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# ungroup
(daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n())) # all flights
NA
Here we continue digressing (IMO) on the relationship between variation and sample size, using examples from the Lahman baseball statistics package.
# Convert to a tibble so it prints nicely
(batting <- as_tibble(Lahman::Batting))
# batting average vs at bats per player
# visualize avg with the number of at-bats
(batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
))
(p <- batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE))
# who are the "best" batters? The lucky ones? Note the at-bats for the top of the list
(high_avg <- batters %>%
arrange(desc(ba))
)
Now we get back to the flight data and try methods of counting.
# Which destinations have the most carriers?
(not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>% # unique values
arrange(desc(carriers)))
# how many values are entered?
(valid_arr_time <- sum(!is.na(not_cancelled$arr_time)))
[1] 327346
# or to get the fraction of valid values
(valid_arr_time_frac <- mean(!is.na(not_cancelled$arr_time)))
[1] 1
(valid_arr_time_frac2 <- mean(!is.na(flights$arr_time)))
[1] 0.9741282
# count() is a dplyr summary of a grouped table
(flights_per_dest <- not_cancelled %>%
count(dest))
# you can weight counts by another variable
(aircraft_miles <- not_cancelled %>%
count(tailnum, wt = distance))
NA
NA
To clarify these differences, use mean and standard deviation of delays; mean absolute value of delays; median delay
Which is more important: arrival delay or departure delay?
To the passenger, arrival delay is most important. To the airport, departure delay is also important.
count(): use
n()(is.na(dep_delay) | is.na(arr_delay) ) is slightly
suboptimal. Why? Which is the most important column?&)Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
What does the sort argument to count() do? When might you use it?
# ---- assess lateness in different ways
# distinguish early + late vs constantly late vs. nearly always on time;
#mean and standard deviation of delays; mean absolute value of delays; median delay
# count the flights per destination without count()
(flights %>% group_by(dest) %>% summarise(n()))
# get aircraft miles: group by the tailnumber and sum the miles